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Abstract 

The SI and S2 subunits of the spike glycoprotein of the coronavirus which is responsible for the severe acute respiratory syn¬ 
drome (SARS) have been modelled, even though the corresponding amino acid sequences were not suitable for tertiary structure 
predictions with conventional homology and/or threading procedures. An indirect search for a protein structure to be used as a 
template for 3D modelling has been performed on the basis of the genomic organisation similarity generally exhibited by coro- 
naviruses. The crystal structure of Clostridium botulinum neurotoxin B appeared to be structurally adaptable to human and canine 
coronavirus spike protein sequences and it was successfully used to model the two subunits of SARS coronavirus spike glycoprotein. 
The overall shape and the surface hydrophobicity of the two subunits in the obtained models suggest the localisation of the most 
relevant regions for their activity. 

© 2003 Elsevier Inc. All rights reserved. 
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The recent outbreak of an atypical pneumonia, 
known as “severe acute respiratory syndrome” or 
SARS, has forced the scientific community to look for 
strategies to defeat the infective agent. In this respect, 
the identification of a coronavirus as responsible for the 
infection and its genomic characterisation [1,2] repre¬ 
sented the first milestone in this race to limit the world¬ 
wide diffusion of SARS infection. 

The second milestone would be represented by the 
development of efficient vaccines and/or specific anti¬ 
viral drugs and, in the post-genomic era. Bioinformatics 
tools will play a primary role in giving rational bases for 
the development of anti-SARS molecular weapons. 

So far, the search for drugs against SARS coronavi¬ 
rus, SARS_CoV, has been carried out on the basis of its 
putative receptor and, in this respect, the human ami- 
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nopeptidase N, or CD 13, has been proposed [3]. Thus, 
the inhibitors of the latter enzyme have been suggested 
in prophylaxis and therapeutics [4], even though some 
opposing data have been reported [5]. 

This type of approach to limit the viral infection should 
lead, hopefully soon, to an unambiguous assignment of 
the cell receptor, but alternative strategies should also be 
found to stop the diffusion of SARS_CoV. 

In this respect, new specific antivirals, designed on the 
basis of structural knowledge of SARS_CoV proteins, 
cannot be synthesised yet, since, in spite of the already 
abundant genomic information, only one crystal struc¬ 
ture [6] and one model [7] for the viral proteins are 
available so far. The principal reason for the apparent 
contradiction between the abundance of experimental 
data and lack of predicted 3D molecular models comes 
from the very poor sequence homology between 
SARS_CoV proteins and the ones available in the Pro¬ 
tein Data Bank [8]. 
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Here, we present the modelling of the SI and S2 
subunits of the spike glycoprotein of SARS_CoV, a 
highly antigenic viral envelope protein, involved in the 
host cell infection. The spike glycoprotein is translated 
as a large polypeptide that is subsequently cleaved by 
virus-encoded or host-encoded proteases to produce, 
like in other coronavirus, two functional subunits, SI 
and S2 [9]. It has also been shown that SI is the pe¬ 
ripheral fragment and S2 is the membrane-spanning 
fragment [10]. Both chimeric S proteins appeared to 
cause cell fusion when expressed individually, suggesting 
that they were biologically fully active [9]. The spike is 
believed to be type I transmembrane protein, with 
N-terminal ectodomains and C-terminal hydrophobic 
anchor, and with an unusual cysteine-rich domain that 
bridges the putative junction of the anchor and the cy¬ 
toplasmic tail [9]. The type I glycoprotein S of corona- 
virus, whose trimers constitute the typical viral spikes 

[10] , is assembled into virions through non-covalent in¬ 
teractions with the M protein. In the SARS spike gly¬ 
coprotein, the S2 domain can be easily identified by 
sequence alignment of other coronavirus proteins. As 
far as the remaining part of the S protein is concerned, 
the same procedure does not yield any information, 
suggesting that SARS_CoV has developed a new type of 
SI domain which is not conserved among the other 
members of the SI family. 

As in the case of other viral membrane proteins, the S 
glycoprotein of SARS_CoV should play an important 
role in the interaction of the virus with its host cell re¬ 
ceptor, having a primary role in eliciting antibodies in 
the host species [3]. Thus, it is apparent that knowledge 
of the 3D structure of SARS_CoV S protein, repre¬ 
senting a critical toxic site and a candidate protective 
antigen, would be of great value in the search for a 
vaccine, in explaining existing data and in designing 
novel diagnostic kits and anti-viral drugs. 

Materials and methods 

A set of 14 different genomic sequences of SARS_CoV were col¬ 
lected using the NCBI [11] databases. The genomic regions codifying 
for the spike glycoprotein were identified from these sequences by 
using ORFfinder search [11]. The corresponding aminoacid sequences 
were aligned and, with the use of ClustalW v. 1.7 [12], 99-100% 
pairwise sequence identities were obtained. Then, a consensus sequence 
obtained from this alignment was used to find a subset of homologous 
S glycoprotein sequences of other coronaviruses using Psi-Blast v. 2.1 

[11] . Some of these sequences were chosen for secondary structure 
prediction and fold recognition studies, using PsiPred v.2.1 [13] and 
GenTHREADER [14], respectively. With the fold recognition method 
the S glycoprotein of Efuman Coronavirus (SwissProt Accession No. 
SP_36334) [15] identifies the 1G9D pdb entry [16], while the S protein 
of Canine Coronavirus (SwissProt Accession No. SP_36300) [15] 
identifies the 1G5G pdb entry [16]. For each hit to a particular PDB 
entry, a SCOP v. 1.63 [17] classification of names and numbers was 
made. Model building of SI and S2 subunits of SARS S glycoprotein 
was carried out with the ClustalW [12] alignment between the target 


sequence and that of the template structure. Models were subsequently 
optimised according to secondary structure predictions. Substitution 
of aminoacid residues and modelling of insertions and deletions in the 
target structure were performed using SwissPDBViewer software [18]. 
The 3D models were optimised by a 900 step minimisation run with 
AMBER [19] and finally validated with the PROCHECK v. 3.5.4 
procedure [20]. Surface accessibility of the potential glycosylation site, 
mutation distribution, and localisation of CD 13 binding regions were 
analysed by using MOLMOL software [21]. With the same software 
possible disulphide bridge formations were investigated by considering 

o 

a threshold distance of 10 A between Cot atoms of cysteine couples, 
after the prediction of the bonding/non-bonding state of all the cys¬ 
teine present in SI and S2 sequences [22]. The hydropathy profile of the 
protein was calculated according to Kyte and Doolittle plots [23] and 
hydrophilic/hydrophobic potentials were calculated with GRID [24], 

o 

using a grid resolution of 1 A and the standard “OH2” and “DRY” 
probes. 

Results and discussion 

The phylogenetic study for the structural proteins E, 
M, S, and the proteinase [1] indicates that the SARS 
coronavirus, even though it is most similar to group II 
of coronaviruses, which includes bovine and murine 
coronaviruses, should rather be considered as the first 
member of a new group IV. The sequence homology of 
the S glycoprotein of SARS_CoV with the ones of other 
coronaviruses, ranging from 20.39% to 27.63%, is rather 
low. 

The genomic drift effects on the S glycoprotein have 
been analysed by aligning the available data. Thus, 
alignment of 14 different sequences obtained with Clu¬ 
stalW produced 99-100% identity. It is interesting to see 
that most of the amino acid mutations in SARS_CoV S 
glycoprotein sequences are predominantly located in the 
SI subunit (1-680). 

Since tertiary structure predictions by sequence ho¬ 
mology and threading algorithms for fold recognition 
studies by using SARS_CoV S glycoprotein sequences 
could not yield reliable molecular models in the present 
report, the molecular models of the SI and S2 subunits 
of SARS_CoV S glycoprotein have been obtained in an 
indirect way. In fact, a preliminary sequence alignment 
of various coronavirus S glycoproteins was performed 
and a fold recognition analysis was done on the aligned 
sequences. Among all the aligned S glycoprotein se¬ 
quences, one of human coronavirus (SwissProt Acces¬ 
sion Number SP36334) and one of canine coronavirus 
(SwissProt Accession Number SP36300) were chosen for 
fold recognition studies using GenTHREADER v2.1 
[14]. The GenTHREADER results for human and ca¬ 
nine CoV S proteins indicated two pdb entries, i.e., 
1G9D of neurotoxin B of Clostridium botulinum and 
1G5G of the Tetanus toxin, as possible templates for 
structure predictions. After sequence alignment of 
SARS S protein (SwissProt Accession Number P59594) 
with the corresponding ones from human and canine 
coronaviruses, model building of SI and S2 subunits 
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was obtained by using the latter two pdb files and 
shuffled PsiPred runs [13], with subsequent manual op¬ 
timisation to enhance the overlapping between the pre¬ 
dicted and observed secondary structure elements. 

The structure of Tetanus toxin as a template allowed 
only a modelling of the carboxy terminus region by in¬ 
serting also long gaps. On the contrary, as shown in 
Fig. 1, by using the other candidate template, i.e., the C. 
botulinum toxin, a good level of similarity was achieved 
with 15% identity, 49% positives, and 5% gaps. Conse¬ 
quently, reliable 3D models of the SI and S2 subunits 
were obtained, see Fig. 2, as the small insertions, which 
still had to de done, did not cause major problems in the 
molecular modelling procedure. In addition, the sec¬ 
ondary structure prediction [13] for both subunits and 
the template structure indicated that the distribution of 
a-helix, [3-sheet, and random coil elements was compa¬ 
rable and located in the same regions. 

In particular, the SARS SI subunit has been mod¬ 
elled between residues 17 and 680, that is between the 
signal peptide and the S2 subunit, and the obtained 
structure has been deposited in the Protein Data Bank 
with the pdb ID 1Q4Z. In good analogy to the template 
fold of the corresponding domain, mainly anti-parallel [3 
sheets with other segregated oe and [3 regions were 
present in the 3D model. 

As far as the S2 subunit is concerned, its outer 
moiety has been modelled between residues 727 and 
1195, just before the predicted transmembrane seg¬ 
ment, and identified between residues 1200 and 1223 
[16]. In the obtained model, deposited in the Protein 
Data Bank with the pdb ID 1Q4Y, the three domains 
which are present in the template structure [17] may be 
clearly identified, see Fig. 2. Thus, domain I (residues 
727-845) with a coiled coil fold contains a set of five 
anti-parallel helices, domain II (residues 846-1048) 
with a sandwich fold-like conacanavalin A is charac¬ 
terised by 12-14 strands in two sheets and, finally, 
domain III (residues 1049-1195) is six-stranded anti¬ 
parallel [3 barrel. A preliminary validation of this S2 
model comes from the high surface exposure of the 
putative CD 13 binding sites [3], respectively, located in 
a helix of the first domain and in two loops of the 
second and third domains. 

A series of considerations should be taken into ac¬ 
count to discuss the reliability of the predicted tertiary 
structures of the SI and S2 subunits of SARS_CoV S 
glycoprotein. The first one implies that simple physico¬ 
chemical rules are fulfilled by the modelled structures 
and, in this respect, the Ramachandran plots and the 
values of the corresponding G factors are excellent, as 
-0.3 and -0.2 were obtained, respectively, for the SI 
and S2 subunits [20]. Furthermore, the fact that hy¬ 
drophobic residues should be mostly buried in the mo¬ 
lecular core with the hydrophilic ones located in surface 
exposed protein regions should be considered. This 
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Fig. 1. Sequence alignment between SARS_CoV S protein and 
C. botulinum neurotoxin B (pdb ID 1G9D) used to construct the 
model. The regions exhibiting the same secondary structure, as pre¬ 
dicted by PsiPred v.2.1, are also shown. 


condition seems to be met by our models, as a good 
agreement between hydropathy profiles [24] and residue 
accessibility can be observed, see Fig. 3. 

In addition to the above-mentioned quality controls 
of our predicted structures, some of them were routinely 
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Fig. 2. Surface and ribbon representations of the tertiary structure of 
the SI and S2 subunits of SARS_CoV S glycoprotein. In the SI rep¬ 
resentation (left) the residues forming the hydrophobic cluster are 
highlighted. In red, green, and yellow the I, II, and III domains of the 
S2 subunit are, respectively, shown (right), together with the putative 
CD 13 binding regions. (For interpretation of the references to colour 
in this figure legend, the reader is referred to the web version of this 
paper.) 

performed upon structure deposition in the Protein 
Data Bank, while another set of criteria to validate the 
molecular models of the SI and S2 subunits of 


SARS_CoV S protein comes from the observation of 
consistencies between the obtained structural features of 
the two models and some common experimentally de¬ 
rived findings, such as glycosylation site distribution and 
disulphide bridge formation. 

As far as the N/O-glycosylation regions are con¬ 
cerned, these are expected to be surface exposed. In the 
SI subunit, only three out of the total 14 predicted 
glycosylation sites [16] exhibit low surface accessibility 
and all of the five possible glycosylation sites result be¬ 
ing exposed in the S2 subunit. 

The disulphide bridge formation is a very critical 
point to assess the reliability of a molecular model and 
the fact that cysteine residues have to be within a suit¬ 
able bonding distance to form disulphide bridges must 
be considered. In the SI subunit there are 20 cysteines, 
while in the S2 subunit are eight in the external portion 
and nine in the internal portion [11] of the C terminus. 
The latter cysteines seem not to have a bridging role, as 
they are predicted in a non-bonding state from neural 
network predictions [22] and, in any case, do not belong 
to the presently modelled part of the molecule. 

It is worth noting that all of the possible cystine 
bridges, 10 and four disulphide bridges, respectively, in 
the SI and S2 subunits, can be formed in the two pro¬ 
posed models, since always the corresponding cysteine 

o 

Cot atoms are found at a maximum distance of 10 A. In 




Fig. 3. Hydropathy vs. residue exposed surface area (ESA) calculated for the models of SI (top) and S2 (bottom) subunits of SARS_CoV S 
glycoprotein. In the graph hydrophobicity (light line) increases on descending the F-axis, whereas residue accessibility (bold line) decreases. 
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particular, for the SI subunit disulphide bridges are 
predicted between C19-C128, C133-C467, C159-C288, 
C278-C474, C323-C657, C348-C511, C366-C378, 

C419-C603, C524-C576, and C635-648 and in the 
model of the S2 subunit disulphide bridges between 
C731-C833, C742-C822, C1014-C1025, and 0064- 
0108 can be obtained. 

The two molecular model models were analysed in 
terms of surface hydrophobic potential to identify pos¬ 
sible binding sites. According to the calculations per¬ 
formed for the SI subunit a highly hydrophobic pocket 
is outlined. In this molecular moiety six hydrophobic 
residues, i.e., Phe 187, Phe 253, Phe 334, Trp 340, Trp 
423, and Tyr 677, contribute to an extensive hydro- 
phobic cluster formation, where long-range interactions 
can contribute to the SI structure stability. This feature, 
not present in the used template structure, should also 
be of fundamental relevance in the SI folding process 
[25] and could represent a suitable target for anti-viral 
drugs. 

In the case of the S2 subunit, the hydrophobic potential 
analysis identifies two putative binding sites in the 
Phe850-Phe870 and in the Phe 1077-Phe 1079 regions, 
both located in the putative binding sites of CD 13 [3]. 

As the analysis of the mutation distribution among 
SARS_CoV proteins could be important to predict its 
antigenic drift, this has been done by aligning the 14 S 
protein sequences so far present in the NCBI database 
for this virus [11] and 10 mutation sites were identified. 
Seven of these mutations are in the SI subunit; the ones 
in position 77 and 244 are both localised in ot helices, 
while the others, 239, 311, 344, 501, and 577, are in loop 
regions. It is interesting to note that, apart from 244 and 
501 residue positions, all the other ones are located in 
well-exposed regions of the protein surface. High surface 
exposure is exhibited also by the sites of the remaining 
three mutations, i.e., 778, 794, and 1148 located in do¬ 
mains I and II of the S2 subunit. 

As a final remark, it should be underlined that the 
consistent series of structural features, exhibited by 
the proposed models for the SI and S2 subunits of the 
SARS_CoV spike protein, supports their reliability as a 
possible rational starting point for anti-viral drug 
design. 
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